Motivation

Parametric vs non-parametric.

What if you had data that looked like this? It’s square, there’s clear edges that define the classes and it’s non-linear. It would be difficult to mathematically represent this data using a linear model like linear regression, logistic regression, glm, etc.





You could fit a logistic regression to this model.

mod <- glm(Y ~ X1 + X2, data = .data, family = 'binomial')
preds <- predict(mod, type = 'response') > 0.5





And if you know exactly what you were doing, you could add an interaction and get slightly better results.

mod <- glm(Y ~ X1 * X2, data = .data, family = 'binomial')
preds <- predict(mod, type = 'response') > 0.5





But knowing that functional form is difficult, especially in real-world high-dimensional datasets. Decision trees over

mod_tree <- rpart::rpart(Y ~ X1 + X2, data = .data)
preds <- predict(mod_tree) > 0.5





Decision trees

We will start at the lowest building block of the decision trees – the impurity metric – and build up from there.

  1. Define an impurity metric which drives each split in the decision tree
  2. Program a decision algorithm to choose the best data split based on the impurity measure
  3. Program a decision tree algorithm by recursively calling the decision algorithm

And then you can extend the tree model into more complex models like bagging, random forest, and XGBoost: 4. Program a bagging model by implementing many decision trees and resampling the data 5. Program a random forest model by implementing many decision trees, resampling the data, and sampling from the columns 6. XGB

Intuition

Binary decision trees create an interpretable decision-making framework for making a single prediction. Suppose a patient comes into your clinic with chest pain and you wish to diagnose them with either a heart attack or not a heart attack. A simple framework of coming to that diagnosis could look like the below diagram. Note that each split results in two outcomes (binary) and every possible condition leads to a terminal node.

The model’s splits can also be visualized as partitioning the feature space. Since the decision tree makes binary splits along a feature, the resulting boundaries will always be rectangular. Further growing of the above decision tree will result in more but smaller boxes. Additional features (X1, X2, ...) will result in additional dimensions to the plot.

But where to split the data? The splits are determined via an impurity index. With each split, the algorithm maximizes the purity of the resulting data. If a potential split results in classes [HA, HA] and [NHA, NHA] then that is chosen over another split that results in [HA, NHA] and [NHA, HA]. At each node, all possible splits are tested and the split that maximizes purity is chosen.

For classification problems, a commonly used metric is Gini impurity. Gini impurity is 2 * p * (1 - p) where p is the fraction of elements labeled as the class of interest. A value of 0 is a completely homogeneous vector while 0.5 is the inverse. The vector [NHA, HA, NHA] has a Gini value of 2 * 1/3 * 2/3 = 0.444. Since Gini is used for comparing splits, a Gini value is calculated per each resulting vector and then averaged – weighted by the respective lengths of the two vectors.

Making a split

The Gini impurity metric. Note that the output of gini is constrained to [0, 0.5].

gini <- function(p){
  2 * p * (1 - p)
}



For convenience, I am going to wrap the gini function so we feed it a vector instead of a probability. The probability is calculated from the mean value of the vector. In practice, this vector will be binary and represent classification labels so the mean value is the proportion of labels that represent a positive classification.

For convenience, I am going to wrap the gini function so we feed it a vector instead of a probability. The probability is calculated from the mean value of the vector. In practice, this vector will be binary and represent classification labels so the mean value is the proportion of labels that represent a positive classification.

gini_vector <- function(X){
  # X should be binary 0 1 or TRUE/FALSE
  gini(mean(X, na.rm = TRUE))
}
X1 <- c(0, 1, 0)
gini_vector(X1)
## [1] 0.4444444

And finally I am going to wrap it again so it gives us the weighted Gini of two vectors.

gini_weighted <- function(X1, X2){
  # X should be binary 0 1 or TRUE/FALSE
  if (is.null(X1)) return(gini_vector(X2))
  if (is.null(X2)) return(gini_vector(X1))
  
  prop_x1 <- length(X1) / (length(X1) + length(X2))
  weighted_gini <- (prop_x1*gini_vector(X1)) + ((1-prop_x1)*gini_vector(X2))
  return(weighted_gini)
}
X2 <- c(1, 1, 1)
gini_weighted(X1, X2)
## [1] 0.2222222

Splitting

At each node, the tree needs to make a decision using the Gini metric. Here a single-dimensional grid search is performed to find the optimal value of the split for a given feature such as X1.

The grid search needs to be expanded to search all possible features (X1, X2, ...). The resulting smallest Gini value is the split the tree uses.

best_feature_to_split <- function(X, Y){
  # X must be a dataframe, Y a vector of 0:1

  # get optimal split for each column
  ginis <- sapply(X, function(x) optimal_split(x, Y))
  
  # return the the column with best split and its splitting value
  best_gini <- min(unlist(ginis['gini',]))[1]
  best_column <- names(which.min(ginis['gini',]))[1]
  best_split <- ginis[['split_value', best_column]]
  pred <- ginis[['preds', best_column]]
  return(list(column = best_column, gini = best_gini, split = best_split, pred = pred))
}
n <- 1000
.data <- tibble(Y = rbinom(n, 1, prob = 0.3),
                X1 = rnorm(n),
                X2 = rnorm(n),
                X3 = rbinom(n, 1, prob = 0.5))
X <- .data[, -1]
Y <- .data[[1]]
best_feature_to_split(.data[, -1], .data[['Y']])
## $column
## [1] "X2"
## 
## $gini
## [1] 0.4374162
## 
## $split
## [1] 2.725928
## 
## $pred
## $pred$split0
## [1] 0.3249749
## 
## $pred$split1
## [1] 1

Recursion

To create the decision trees, the splitting algorithm should be applied until it reaches a certain stopping threshold. It is not known prior how many splits it is going to make – the depth or the width. This is not easily solved using a while loop as a split results in two new branches and each can potentially split again. Recursion is required.

In recursive functions, the function is called within itself until some stopping criteria is met. A simple example is the quicksort algorithm which sorts a vector of numbers from smallest to greatest.

Quicksort is a divide-and-conquer method that splits the input vector into two vectors based on a pivot point. Points smaller than the pivot go to one vector, points larger to the other vector. The pivot point can be any point but is often the first or last item in the vector. The function is called on itself to repeat the splitting until one or less numbers exist in the resulting vector. Then these sorted child-vectors are passed upward through the recursed functions and combined back into a single vector that is now sorted.

We’re going to implement the above splitting algorithm as a recursive function which builds our decision tree classifier. The tree will stop if it exceeds a certain depth, a minimum number of observations result from a given split, or if the Gini measure falls below a certain amount. Only one of these methods is required however including all three allow additional hyperparameter tuning down-the-road.

The function recursively calls the best_feature_to_split() function until one of the stopping criteria is met. All other code is to manage the saving of the split decisions. The output is a dataframe denoting these decisions.

Where trees struggle

Trees will struggle when the parameter space is dissected at an angle by the classification value. Since regression trees are partitioning the parameter space into rectangles, the tree will need to be deeper to approximate the decision boundary.

The below data’s classification is in two separate triangles: top left and bottom right of the plot. A logistic regression finds the boundary easily.





# decision tree
mod_tree <- rpart::rpart(Y ~ X1 + X2, data = .data, control = rpart::rpart.control(maxdepth = 2))
preds <- predict(mod_tree) > 0.5





# logistic regression
model_log <- glm(Y ~ X1 + X2, data = .data, family = 'binomial')
preds <- predict(model_log, type = 'response') > 0.5 





Bagging

Single decision trees are prone to overfitting and can have high variance on new data. A simple solution is to create many decision trees based on resamples of the data and allow each tree to “vote” on the final classification. This is bagging. The process keeps the low-bias of the single tree model but reduces overall variance.

The “vote” from each tree is their prediction for a given observation. The votes are averaged across all the trees and the final classification is determined from this average. The trees are trained on bootstrapped data – taking repeated samples of the training data with replacement.

Random forest

Random forest is like bagging except in addition to bootstrapping the observations, you also take a random subset of the features at each split. The rule-of-thumb sample size is the square root of the total number of features.

XGBoost

lorem ipsum

Model comparison on real world data

credit <- readr::read_csv('https://raw.githubusercontent.com/joemarlo/regression-trees/main/workshop/data/credit_card.csv')
# create train test split
X <- select(credit, -Class)
Y <- credit$Class
indices <- sample(c(TRUE, FALSE), size = nrow(credit), replace = TRUE, prob = c(0.5, 0.5))
X_train <- X[indices,]
X_test <- X[!indices,]
Y_train <- Y[indices]
Y_test <- Y[!indices]
# fit the bagged model
model_bag <- ipred::bagging(Class ~ ., data = credit[indices,])
preds <- predict(model_bag, newdata = credit[!indices,])
table(preds > 0.5, Y_test)
##        Y_test
##           0   1
##   FALSE 495  31
##   TRUE   13 217
# fit a random forest
model_ranger <- ranger::ranger(Class ~ ., data = credit[indices,], num.trees = 50, 
                               max.depth = 10, importance = 'impurity')
preds <- predict(model_ranger, data = X_test)$predictions
table(preds > 0.5, credit$Class[!indices])
##        
##           0   1
##   FALSE 496  29
##   TRUE   12 219
# fit an xgb
model_xgb <- xgboost::xgboost(data = as.matrix(X_train), label = Y[indices], objective = "binary:logistic",
                              max.depth = 2, eta = 1, nthread = 2, nrounds = 2)
## [1]  train-logloss:0.230453 
## [2]  train-logloss:0.157840
preds <- predict(model_xgb, newdata = as.matrix(X_test))
table(preds > 0.5, credit$Class[!indices])
##        
##           0   1
##   FALSE 496  36
##   TRUE   12 212

Feature importance

ranger::importance(model_ranger) %>% 
  tibble::enframe() %>% 
  ggplot(aes(x = reorder(name, -value), y = value)) +
  geom_col() +
  labs(title = 'Variables ranked by importance',
       x = NULL,
       y = 'Importance') +
  theme(axis.text.x = element_text(angle = -40, hjust = 0))

How to do this systematically

Enter tidymodels and tuning

# tidymodels
# set model form
credit_recipe <- recipes::recipe(Class ~ ., data = credit)

# include any pre-processing steps here: e.g. log transforms, normalizing, etc.

# specify basic model
mod_glm <- logistic_reg() %>% 
  set_engine('glm') %>% 
  set_mode('classification')

# example: fit one model
workflow() %>%
  add_formula(Class ~ .) %>% 
  add_model(mod_glm) %>% 
  fit(credit)
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## Class ~ .
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
## (Intercept)         Time           V1           V2           V3           V4  
##  -3.934e+00   -7.364e-06    3.201e-01    4.135e-01    7.920e-02    7.893e-01  
##          V5           V6           V7           V8           V9          V10  
##   4.577e-01   -3.781e-01   -3.278e-01   -3.068e-01   -2.600e-01   -4.859e-01  
##         V11          V12          V13          V14          V15          V16  
##   2.789e-01   -7.636e-01   -2.661e-01   -7.889e-01   -1.869e-01   -2.353e-01  
##         V17          V18          V19          V20          V21          V22  
##  -6.993e-02   -1.504e-01    1.652e-01   -8.150e-01    2.656e-02    7.026e-01  
##         V23          V24          V25          V26          V27          V28  
##   1.732e-01    2.222e-01    3.264e-01   -6.971e-01    3.174e-01    5.091e-01  
##      Amount  
##   5.817e-03  
## 
## Degrees of Freedom: 1491 Total (i.e. Null);  1461 Residual
## Null Deviance:       1892 
## Residual Deviance: 339.4     AIC: 401.4

Fit many models at once and evaluate them

# specify more models
mod_dt <- decision_tree(tree_depth = tune()) %>% 
  set_engine('rpart') %>% 
  set_mode('classification')
mod_rf <- rand_forest(trees = tune(),
                      mtry = tune(),
                      min_n = tune()) %>%
  set_engine('ranger', seed = 44) %>%
  set_mode("classification")
mod_xgb <- boost_tree(trees = tune(),
                      mtry = tune(),
                      tree_depth = tune()) %>%
  set_engine('xgboost', 
             seed = 44) %>% 
  set_mode("classification")

## fit all the models
# organize models into a workflow
credit_workflow <- workflow_set(
  list(basic = credit_recipe),
  models = list(glm = mod_glm,
                dt = mod_dt,
                rf = mod_rf,
                xgb = mod_xgb)
)

# cross validation
credit_split <- initial_split(credit, prop = 0.6)
credit_train <- training(credit_split)
credit_test <- testing(credit_split)
credit_folds <- vfold_cv(credit_train, v = 3)

# tune the models using a grid
credit_grid <- credit_workflow %>%
  workflow_map(
    'tune_grid',
    resamples = credit_folds,
    seed = 44,
    verbose = TRUE
  )

# look at the metrics
# purrr::map(credit_grid$result, ~show_best(.x, metric = 'accuracy'))
# purrr::map(credit_grid$result, ~select_best(.x, metric = 'accuracy'))

# look at the metrics
autoplot(credit_grid) + 
  labs(title = 'Cross-validation results', 
       y = NULL)








Bayesian Additive Regression Trees (BART)

Frequentist vs Bayesian primer

Two statistical methodologies: frequentist and Bayesian

Frequentist: - confidence interval - p-values - power - significance

Parameter is fixed Intervals are important but ultimately care about point-estimate

Bayesian: - credible intervals - priors - posteriors

Parameter is a random variable from some distribution The distribution and interval is the most important Things are “more likely” or “less likely”

Quickly fitting a bart model

mod_bart <- bart(trees = 100) %>%
  set_engine('dbarts') %>%
  set_mode("classification")

fit_bart <- workflow() %>%
  add_formula(Class ~ .) %>% 
  add_model(mod_bart) %>% 
  fit(credit[indices,])

preds <- predict(fit_bart, credit[!indices,])
table(credit$Class[!indices], abs(as.numeric(preds$.pred_class) - 2))
##    
##       0   1
##   0 501   7
##   1  31 217

BART for causal inference

bartCause

library(bartCause)
lalonde <- readr::read_csv('https://raw.githubusercontent.com/priism-center/thinkCausal_dev/master/thinkCausal/data/lalonde.csv')
# assess balance
confounders <- setdiff(colnames(lalonde), c('treat', 're78'))
plotBart::plot_balance(lalonde, 'treat', confounders)

# assess overlap
plotBart::plot_overlap_vars(lalonde, 'treat', confounders)

plotBart::plot_overlap_pScores(lalonde, 'treat', 're78', confounders)
## fitting treatment model via method 'bart'
## fitting response model via method 'bart'

# run model    
bart_ate <- bartCause::bartc(
  response = lalonde$re78,
  treatment = lalonde$treat,
  confounders = lalonde[, confounders],
  estimand = 'ate',
  commonSup.rule = 'sd'
)
## fitting treatment model via method 'bart'
## fitting response model via method 'bart'

Assessing fit

plotBart::plot_trace(bart_ate)

Results and interpration

mean(bartCause::extract(bart_ate, "cate"))
## [1] 1584.514
plotBart::plot_CATE(bart_ate, type = 'density', 
                    ci_80 = TRUE, ci_95 = TRUE, .mean = TRUE)

For participants in this study, receiving the treatment condition led to an increase of ~1600 units compared to what would have happened if participants did not receive the treatment condition.

Final thoughts

Benefits of tree methods:

Downsides: